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1.  Introduction 

Attempts  to  build  superconductor  computers  based  on  low-temperature  super- 
conductivity of  Josephson  junctions  have  come  and  gone  during  the  last  thirty 
years.  In  spite  of  both  industry  and  the  federal  government's  deep  involvement  in 
these  efforts,  no  full-fledged  superconductor  computers  have  been  built  so  far.  As 
a result,  superconductors,  once  considered  an  alternative  to  semiconductors,  have 
lost  their  appeal  to  the  general  design  community.  By  2003,  tremendous 
improvements  in  silicon-based  technology  have  allowed  semiconductor  chips  to 
have  hundreds  of  millions  of  transistors  and  clock  frequencies  exceeding  3 GHz, 
speeds  earlier  considered  the  province  of  exotic  technologies,  such  as  super- 
conductors. 

Although  superconductors  have  not  become  a mainstream  digital  technology 
(and  perhaps  it  is  fair  to  say  they  were  never  expected  to  be),  they  are  still  trying  to 
find  their  way  into  several  important  niche  areas,  one  of  which  is  extreme  (at  the 
moment  called  petaflops)  computing.  Thanks  to  the  support  by  various  U.S. 
government  agencies,  work  on  superconductor  technology  and  processor  design 
continued  behind  the  scenes,  nearly  invisible  to  silicon-based  computer  designers. 
In  this  paper,  we  consider  the  lessons  learned  from  past  superconductor  computer 
projects,  the  current  status,  and  future  directions  of  the  work  in  this  field. 


2.  IBM  Josephson  computer  technology  project 

The  first  full-scale  attempt  to  implement  superconductor  computing  was  a program 
at  IBM  in  the  1972-1983  time  frame,  using  the  first-generation  superconductor 
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technology  based  on  Josephson  junction  (JJ)  devices  in  "latching"  logic  circuits. 
The  goal  was  to  develop  the  technology  to  a level  where  a significant,  useful  high 
performance  computer  would  be  viable.  The  potential  customers  included  both 
"the  comer  bank"  and  the  special  users  requiring  the  highest  computation  rates.  In 
the  device  domain,  fabrication  processes,  a circuit  family,  packaging  and  powering 
techniques  were  developed  and  demonstrated.  As  a demonstration  of  system 
performance,  a digital  signal  processor  called  JSP  (for  Josephson  signal  processor) 
was  designed,  based  upon  a semiconductor  version.  The  project  then  focused  on 
building  and  testing  this  JSP  until  the  program  was  terminated  in  September  of 
1983.  In  retrospect,  the  goals  of  this  project  were  clock  frequencies  about  1 GHz, 
the  highest  that  could  be  expected  with  the  fabrication  and  circuit  technology 
available  at  that  time.  The  IBM  fabrication  technology  was  improved  upon  during 
the  MITI  project  in  Japan  in  the  1980's,  particularly  with  the  change  from  niobium- 
lead  alloy  junctions  and  lead-alloy  wiring  to  all-refractory  niobium-based 
fabrication.  Fujitsu,  NEC,  Hitachi,  as  well  as  Japan's  ETL  were  funded  under  this 
project,  but  despite  small  processor  and  memory  demonstrations,  no  full-fledged 
supercomputing  effort  was  mounted. 


3.  Hybrid  technology  multi-threaded  petaflops  computer  project 

The  hybrid  technology  multi-threaded  (HTMT)  architecture  petaflops  project 
(1996-2000)'  presented  another  chance  to  resurrect  superconductor  computing  after 
almost  15  years  since  the  end  of  the  IBM-led  project  in  US  and  the  MITI  project  in 
Japan. 

New  hopes  for  success  were  based  on  several  breakthroughs  in  circuit  design 
such  as  development  a new  rapid  single  flux  quantum  (RSFQ)2  logic  using  the 
magnetic  flux  quantization  properties  of  superconducting  rings  with  Josephson 
junctions,  as  well  as  the  development  of  niobium-trilayer  technological  processes 
in  the  U.S.  This  RSFQ  logic  represented  a significant  improvement  over  previous 
superconductor  logic  families  employed  by  IBM  and  the  MITI  program  in  Japan. 
Those  earlier  programs  were  based  on  voltage-latching  gates  that  dissipated  more 
power  and  were  limited  to  several  GHz  by  the  need  to  actively  reset  each  gate 
every  clock  cycle.  In  addition,  latching  logic  required  multi-phase  rf  power 
systems,  while  RSFQ  logic  uses  only  dc  power. 

RSFQ  circuits  operate  by  the  creation,  elimination,  and  propagation  of  single 
magnetic  flux  quanta  (<|>o  = hlle  = 2.07  mV  ps)  in  small  monolithic  thin-film 
inductive  loops  (a  few  picohenries)  containing  Josephson  tunnel  junctions. 
Josephson  junctions  can  switch  and  perform  logic  functions  in  a few  picoseconds. 
Bits  are  stored  as  a persistent  supercurrent  in  the  inductor,  at  zero  voltage.  Bits 
propagate  from  one  gate  to  the  next  as  millivolt  picosecond  SFQ  pulses,  not 
voltage  levels.  Matched  strip  and  microstrip  lines  are  used  to  propagate  signals 
beyond  the  adjacent  gate. 

These  technological  improvements  coincided  with  the  decision  by  policy 
makers  at  U.S.  federal  agencies  that  computer  processing  capabilities  reaching  and 
exceeding  one  petaflop  (1015  floating-point  operations  per  second)  were  in  the 
national  interest.  It  was  recognized  that  massively  parallel  systems  designed  with 
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thousands  of  silicon-based  processors  could  reach  this  target  not  earlier  than  2010. 
As  suggested  by  the  authors  of  the  HTMT  concept  and  later  confirmed  by  the 
results  of  the  HTMT  studies,  with  adequate  investments  into  technology  and  design 
tools,  superconductor  processors  coupled  with  other  new  technologies  and 
architectural  concepts  could  provide  a "shortcut"  into  this  petaflops  computing 
territory.  Thus,  in  1996  superconductor  processor  design  got  a new  life  with  a new 
focus  on  extreme,  petaflops  level  computing,  where  it  could  outperform,  in  terms 
of  speed  and  power,  semiconductor  processors. 

The  results  of  these  HTMT  design  studies  related  to  the  area  of  superconductor 
computing  have  been  the  following: 

• the  development  of  a parallel  superconductor  processor  element  (Spell) 
architecture3  with  two-level  simultaneous  multithreading,  capable  of 
addressing  the  huge  disparities  in  cycle  times  and  access  latency  between 
superconductor  Spell  processors  and  silicon-based  memory; 

• the  feasibility  analysis  of  possible  implementation  of  a multiprocessor 
superconductor  sub-subsystem4  consisting  of  4,096  ~50  GHz  Spell 
processors  with  their  local  small  superconductor  memory  interconnected 
by  a multi-stage  superconductor  switch,  all  implemented  in  the  (still  to  be 
developed)  0.8  pm,  20  kA/cm2  superconductor  technology; 

• efficient  integration  of  the  superconductor  sub-system  with  other 
components  of  a hierarchical  HTMT  system; 

• development  of  new  4 kA/cm2  critical  current  density,  1.75  pm  Nb/Al- 
AlOx/Nb  trilayer  superconductor  process5  at  the  former  TRW,  Inc.  (now 
Northrop  Grumman  Space  Technology  - NGST).  Since  that  time,  NGST 
has  advanced  to  an  8 kA/cm2,  1.25  pm  process  and  is  collaborating  with 
JPL  to  initiate  a 20  kA/cm2,  0.8  pm  process. 

In  parallel  with  these  developments  at  SUNY-Stony  Brook  and  TRW,  Prof. 
Van  Duzer  and  his  team  at  UC  Berkeley  worked  on  an  RSFQ  chip-level  interface 
for  a 64-Kbit  CMOS  RAM  with  a projected  read  access  time  of  ~600  ps  at  4 °K.6 
The  relatively  high  power  dissipation  for  these  Josephson-CMOS  hybrid  memory 
chips  remains  a concern,  so  the  lack  of  fast,  dense,  and  low  power  superconductor 
memory  remains  one  of  the  longstanding  problems  to  be  overcome  before 
petaflops  computing  with  superconductor  processors  becomes  feasible. 

Outside  the  HTMT  project,  another  project  known  as  the  Superconductive 
Crossbar  was  initiated  by  the  U.S.  Dept,  of  Defense  in  1990's,  to  demonstrate  the 
feasibility  of  using  superconductor  technology  for  very  high  data  rate 
communication  among  processors  as  well  as  shared  memory  in  parallel 
multiprocessor  systems.  The  architecture  for  such  a switch  was  a 128x128  self- 
routing crossbar  capable  of  processing  data  streams  at  a rate  of  2.5  Gbit  per  second 
per  channel. 

Although  the  results  of  the  four  years  of  the  HTMT  studies  were  optimistic 
about  the  potential  use  of  superconductor  technology  for  supercomputing,  they  also 
highlighted  the  huge  gap  between  the  current  state  of  superconductor  circuit  design 
and  VLSI  fabrication,  and  the  high  complexity  and  reliability  required  from 
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superconductor  chips  for  such  a petaflops  system.  By  2000,  the  most  complex 
superconductor  LSI  chips  had  only  a few  thousands  junctions,  capable  of  doing 
only  some  non-programmable,  built-in  signal  processing  functions.  By 
comparison,  the  HTMT  Spell  processors  were  expected  to  have  up  to  400K  gates, 
or  several  million  junctions  per  chip.  Although  other  new  HTMT  technologies 
were  also  in  experimental  stages  of  development,  none  of  them  created  more 
controversy  than  superconductor  processors  among  reviewers  of  the  project.  It  was 
perhaps  not  surprising  that  sponsors  were  reluctant  to  commit  several  hundred 
million  dollars  for  the  full  development,  design,  and  fabrication  of  an  HTMT 
prototype  system  with  several  thousands  of  superconductor  processors  until  at  least 
a small  superconductor  microprocessor  prototype  had  been  demonstrated. 


4.  8-bit  FLUX-1  microprocessor 

In  order  to  address  these  issues  and  get  a practical  handle  on  real-world 
superconductor  VLSI  microprocessor  chip  design,  the  collaboration  between 
SUNY  Stony  Brook,  TRW,  and  JPL  (NASA)  to  design  and  demonstrate  a FLUX-1 
microprocessor  chip  started  in  2000.7  A FLUX-1  chip  was  expected  to  be  a 
superconductor  technology  "driver",  without  any  plans  for  using  it  in  future 
superconductor  systems.  Its  design  was  based  on  NGST's  new  4 kA/cm2  1.75  pm 
Nb-trilayer  superconductor  fabrication  process.5 

Figure  1 shows  the  block  diagram  of  the  20  GHz  8-bit  FLUX-1  micro- 
processor, the  most  ambitious  chip  implemented  in  RSFQ  technology  to  data  It 
includes  a small  instruction  memory  of  16  30-bit  instructions  with  the  embedded 
program  counter  and  instruction  fetch  logic,  the  branch  unit,  the  instruction  register 
and  dual  decode/issue  logic,  eight  bit-stream  arithmetic  logic  units  (ALUs) 
interleaved  with  eight  8-bit  general-purpose  integer  registers  (REG0-REG7),  two 
8-bit  20  GHz  I/O  ports,  the  clock  controller,  and  built-in  scan  path  circuitry. 
FLUX-1  has  a partitioned  deeply-pipelined,  synchronous  dual-op  long-instruction- 


Figurel.  Block  diagram  of  a FLUX-1  microprocessor. 
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word  architecture  with  an  instruction  set  of  ~25  control,  integer  arithmetic,  and 
logical  operations.  Two  instructions  can  be  issued  and  completed  during  each  50 
ps  cycle.8'9 

The  functional  correctness  of  the  architecture  was  verified  with  a cycle- 
accurate  FLUX-1  simulation  model.  During  the  physical  implementation  of  the 
chip,  the  chip  layout  was  done  manually  due  to  the  lack  of  synthesis  and  automatic 
placement  and  routing  tools  tuned  for  superconductor  technology. 

The  first  wafers  with  FLUX-1  chips  were  delivered  in  August  of  2001. 
Because  of  a very  high  6.9  A bias  current  supplied  to  these  chips,  the  power  pads 
on  the  FLUX-1  chip  carrier  had  to  be  redesigned  to  avoid  overheating.  The  large 
chip  size  of  140  mm2  made  it  very  difficult  to  find  dies  without  any  defects. 
Nevertheless,  the  most  serious  problem  was  the  very  high  bit  error  rate  (BER)  in 
the  FLUX-1  scan  path  circuits,  which  made  further  chip  testing  impossible.  It  was 
concluded  that  this  high  BER  was  probably  due  to  errors  in  the  circuit-level  design 
of  transmission  line  drivers  and  receivers.  As  a result,  both  the  FLUX-1  gate 
library  and  the  chip  layout  were  improved.  First  wafers  of  the  revised  FLUX-1R 
chip,  with  the  same  architecture  but  different  circuit-level  design,  were  fabricated 
in  the  summer  of  2002. 10  FLUX-1R  was  built  using  a library  of  only  10  SFQ  gates, 
which  were  designed  to  be  interconnected  with  passive  matched  transmission 
lines."'12  A FLUX-1R  chip  shown  in  Fig.  2 has  21%  less  chip  area,  33%  less  bias 
current,  4%  fewer  Josephson  junctions,  and  other  improvements  over  the  first 
version  of  the  chip.  The  total  power  dissipation  is  ~10  mW. 

The  complete  FLUX-1R  stripline-connected  gate  library  has  been  verified  and 
the  correct  operation  of  several  FLUX-1R  microprocessor  blocks  has  been 
demonstrated,  including  the  13-stage  scan  path  within  the  first  instruction  decoder 
on  the  FLUX-1R  chip  itself,  a significant  advance  over  the  first  design.  Measured 
low-frequency  gate  bias  margins  (see  Table  1)  were  large,  reproducible  between 
chips  from  different  wafers,  and  in  good  agreement  with  simulations.  Equally 
important,  all  gate  bias  margins  strongly  overlap. 


Figure  2.  FLUX-1  R chip  photomicrograph.  There  are  63,107  Josephson  junctions 
on  a 10.35x10.65  mm8  die. 
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A number  of  larger  breakout  structures  were  also  tested  successfully,  although 
operating  margins  were  smaller.  To  date,  the  largest  circuit  block  successfully 
tested  with  reproducible  margins  across  different  chips  is  a 1-bit  ALU  register 
block  that  is  a key  element  of  the  FLUX-1  computational  engine,  where  this  block 
is  replicated  64  times  to  create  a parallel  bit-stream  data  processing  data  path.  This 
circuit  is  a microcosm  of  the  FLUX-1R  chip  and  makes  liberal  use  of  passive 
interconnects,  including  crossovers.  Further  testing  of  the  FLUX-1R  chip  is 
planned.10'13 

Among  other  results  of  the  FLUX-1  project,  the  most  important  ones  are: 

• the  successful  demonstration  of  inter-chip,  flux-quantum,  serial-data 
transmission  up  to  60  Gb/s  through  a passive  substrate,14  and 

• development  of  a new  advanced  8 kA/cm2,  1.2  pm  junction  fabrication 
process,15  as  the  next  step  in  the  evolution  towards  higher  speed  and 
increased  circuit  density. 

The  new  process  is  based  on  NGST's  4 kA/cm2  Nb  process  used  previously  for  the 
FLUX-1R  chip  fabrication,  but  several  new  process  steps  have  been  developed  to 
improve  yield  and  reproducibility.  Minimum  junction  size  and  line-pitch  in  the  8 
kA/cm2  process  are  1 .2  pm  and  2.6  pm,  respectively.  Critical  current  spreads  are 
typically  1.5%  (lo)  for  the  arrays  of  1.2  pm  junctions,  comparable  to  the  best 
spreads  achieved  in  the  4 kA/cm2  process.  The  new  process  has  been  used  to 
demonstrate  a 300  GHz  static  digital  divider  and  to  fabricate  complex  digital 
circuits  of  28,000  junctions. 


Gate 

V bias  (mV) 

Margin  (mV) 

Margin  (%) 

XOR 

1.74 

0.75 

43% 

SPLIT 

1.63 

0.70 

43% 

DFF 

1.79 

0.70 

39% 

INV 

1.63 

0.60 

37% 

MERGER 

1.79 

0.65 

36% 

D2FF 

1.68 

0.55 

33% 

INPUT 

1.75 

0.55 

31% 

AND 

1.79 

0.50 

28% 

OUTPUT 

1.82 

0.50 

27% 

NDRO 

1.73 

0.45 

26% 

All  gates 

1.73 

0.45 

26% 

Table  1 . FLUX-1  R gate  library  bias  margins. 
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5.  FLUX-2:  32-bit  floating-point  vector  multiplier  chipset 

In  2002,  in  parallel  with  the  testing  a FLUX-1R  microprocessor  chip  at  NGST, 
work  started  on  a new  FLUX-2  project,  a 32-bit  floating-point  (FP)  vector 
multiplier  unit  chipset  with  a target  clock  frequency  of  25  GHz,  by  the  same  team 
of  SUNY-Stony  Brook,  NGST,  and  JPL.  The  primary  goal  of  the  FLUX-2  project 
is  to  shed  light  on  how  complex  processing  units  operating  at  25+  GHz  clock 
frequencies  and  consisting  of  multiple  superconductor  chips  can  be  designed, 
fabricated,  and  packaged  on  a single  MCM  carrier.  This  is  a crucial  step  for 
superconductor  computer  design  because  the  low  gate  density  of  the  current,  and 
perhaps  even  next  generation,  superconductor  chips  will  not  allow  a 32/64-bit 
processor  to  be  implemented  on  a single  chip. 

Below  we  discuss  in  detail  the  major  results  of  the  work  in  progress  done  by 
the  SUNY-Stony  Brook  team  that  has  been  developing  the  architecture  and  gate- 
level  design  of  a dual-chip  32-bit  FP  vector  multiplier  of  the  FLUX-2  chipset. 

Our  goal  is  to  design  a dual-chip  multiplier  capable  of  calculating  one  32-bit 
floating-point  result  per  40  ps  cycle,  while  processing  two  input  streams  of  data 
encoded  in  the  IEEE  754  single  precision  format  from  vector  register  memory  on  a 
separate  chip.  All  data  transfers  between  chips  placed  on  a multi-chip  module  are 
to  be  done  over  superconductor  Nb  transmission  lines  at  the  same  clock  rate  used 
to  process  and  read/write  data  inside  the  chips. 

The  key  problems  to  be  solved  are  the  following: 

• partitioning  of  the  32-bit  FP  pipelined  vector  multiply  unit  into  multiple 
chips  of  ~1  cm2  size  each,  with  a gate  count  (~10K)  and  the  number  of  I/O 
pads  of  each  chip  within  the  capabilities  of  NGST's  1.2  pm  chip 
fabrication  and  solder  reflow  packaging  processes; 

• tolerating  significant  clock  and  data  skew  both  between  the  chips  on  an 
MCM  carrier  and  between  blocks  within  each  chip; 

• processing  of  vector  data  from  memory  at  the  25  GHz  rate. 

The  top-level  functional  block  diagram  for  a FLUX-2  floating  point  multiplier 
is  shown  in  Fig.  3.  A 32-bit  FP  multiplier  consists  of  an  unsigned  24-bit  integer 
multiplier  for  significands  (mantissas),  8-bit  adder  for  exponents,  sign  calculation 
by  XORing  the  signs  of  two  operands,  plus  support  circuitry  to  deal  with  the 
exponents  and  special  values  (0,  +inf,  -inf,  etc.).  The  significand  is  in  the  [1,  2] 
range,  so  the  product  falls  in  the  [1,  4]  range.  Thus,  normalizing  as  well  as 
incrementing  a tentative  exponent  may  be  required.  The  IEEE  floating  point 
multiplier  has  four  different  rounding  schemes,16’17  necessitating  a second  set  of 
"normalize"  and  "adjusting  exponent"  steps  after  the  "round"  step,18  because  for 
some  of  the  schemes  the  result  after  rounding  needs  to  be  shifted.  In  our  case,  we 
implemented  only  truncation  rounding  which  does  not  require  that  extra  step. 

The  slowest  and  the  most  challenging  part  of  a multiplier  is  multiplication  of 
significands.  Indeed,  the  largest  part  of  the  critical  path  delay  (more  than  90%  in 
our  case)  comes  from  the  24-bit  unsigned  integer  multiplication  of  significands. 
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Figure  3.  Functional  block  diagram  of  a FLUX-2  floating-point  multiplier. 


Many  different  algorithms  have  been  proposed  for  fast  integer  multiplication.20 
Logically,  it  can  be  separated  into  three  stages:  partial  product  (PP)  generation,  PP 
reduction,  and  final  summation. 

While  Booth's  algorithm  for  PP  generation  together  with  the  Wallace/binary 
tree  for  the  PP  reduction  can  be  found  in  most  CMOS  high-speed  multipliers,20'21 
this  combination  does  not  provide  the  best  solution  in  the  case  of  an  RSFQ 
multiplier  because  the  cost  of  broadcasting  signals  with  RSFQ  splitters  is  high  (in 
terms  of  latency  and  data  skew).  Further,  the  relatively  small  number  of  metal 
layers  available  with  the  present  superconductor  process  makes  it  very  difficult  to 
lay  out  the  Wallace/binary  trees  without  wasting  too  much  valuable  chip  area. 

For  the  case  of  24-bit  multiplication  implemented  in  RSFQ  logic,  our  analysis 
of  the  Booth  2 encoding  with  different  PP  reduction  topologies  showed  that  the  use 
of  Booth  2 can  give  only  insignificant  performance  gains,  decreasing  the  multiply 
latency  by  one  cycle  at  most  for  some  of  the  topologies.  In  the  meantime,  the 
savings  in  area  due  to  the  decreased  number  of  PP's  with  the  Booth  2 encoding  (13 
with  Booth  2 vs.  24  without  it)  are  compensated  by  the  increase  in  area  to  lay  out 
the  significantly  larger  number  of  wires  required  to  broadcast  signals.  This 
analysis  precluded  the  use  of  any  Booth  encoding  in  the  PP  generation  stage. 

In  theory,  a binary  tree  used  during  the  PP  reduction  phase  of  multiplication  of 
significands  gives  the  best  performance  in  terms  of  the  number  of  pipeline  stages, 
but  performance  degrades  when  it  is  implemented  in  RSFQ  logic.  The  superior 
performance  of  a binary  tree  (or  any  tree  topology  in  general)  comes  at  a heavy 
price.  Uneven  concentration  of  wires  in  different  pipeline  stages  complicates  the 
physical  layout  and  produces  wasted  area.  The  less  compact  layout  adds  additional 
pipeline  stages  in  the  critical  path.  We  chose  the  (4-4-6-10)  high-order  array 
topology  for  the  PP  reduction  because  it  is  easier  to  lay  out,  and  it  gives  the  same 
performance  as  the  binary  tree. 

The  advantage  of  a high-order  array  is  that  the  number  of  connections  between 
sub-arrays  is  constant.  A two-dimensional  structure  of  such  array  is  easy  to  lay 
out,  especially  given  the  fact  that  all  rows  are  about  of  the  same  width,  i.e.,  have 
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the  same  number  of  compressors.  The  layout  is  very  regular  and  implemented  by 
tiling  of  modules.  There  are  total  of  8 different  module  types,  each  with  three  [4:2] 
compressors  and  some  with  the  PP  encoding  logic  and  signal  propagation  latches. 
By  breaking  long  wires  between  rows  (which  combine  results  from  sub-arrays)  and 
aligning  rows  in  a specific  manner,  we  developed  a systolic-like  structure  where 
each  module  is  only  connected  to  its  neighbors.  The  modules  are  connected  in  a 
comer-based  clock  network.  A clock  distribution  network  with  very  small  clock 
skew  is  implemented  inside  each  module. 

Truncation  rounding  is  performed  in  parallel  with  the  reduction  and  the 
encoding  stages  by  adding  the  least  significant  bits  coming  from  the  high-order 
array  in  a ripple-carry  fashion.  The  sum  bits  are  dropped,  while  a carry  value  is 
propagated  to  the  final  adder.  In  the  final  stage,  the  Kogge-Stone  algorithm22  is 
executed  to  add  two  32-bit  numbers.  The  exponent  adder  is  not  on  the  critical  path, 
whereas  the  exponent  adjustment  is  on  it.  In  order  to  reduce  overall  latency, 
adjusting  of  an  exponent  is  implemented  by  calculating  two  sums  simultaneously, 
the  sum  of  two  exponents  and  the  sum  of  two  exponents  plus  one,  with  each  sum 
calculated  by  a separate  ripple-carry  adder.  If  the  result  from  the  final  adder  has  to 
be  normalized,  then  the  second  sum  is  chosen  to  be  a new  exponent  value.  The 
first  sum  is  selected  otherwise. 

Table  2 shows  the  estimated  latency,  area,  and  complexity  for  different  blocks 
of  the  FLUX-2  dual  chip  multiplier.  All  physical  characteristics  of  the  chips  were 
estimated  for  the  gate-level  multiplier  schematics  developed  at  SUNY-Stony 
Brook,  using  NGST's  jl  10D  FLUX-1  gate  library,  with  gates  as  well  as  the  wire 
pitch  scaled  down  to  those  in  the  NGST's  new  jl  10E  8 kA/cm2,  1.2  pm  process. 


FP  multiplier  chip  1 

FP  Multiplier  chip  2 

Dual-chip 
32-bit  FP 
multiplier 
total 

Encoding 
and  PP 
reduction 

Rounding 

Final 

adder 

Exponent 

adder 

Normalize 
significand 
and  adjust 
exponent 

Latency, 
40  ps 
cycles 

18 

not  on  the 
critical  path 

8-10 

not  on  the 
critical 
path 

1 

27-29 
(w/o  chip- 
to-chip 
wire 
delay) 

JJ  count 

-105K 

-1.2K 

~17K 

-8.3K 

-1.5K 

-135K 

Area, 

mm2 

~ 6.5  X 6 

~ 3 x 3 

~ 10X  10 

Table  2.  Estimated  design  parameters  for  a 25  GHz  32-bit  FP  dual-chip  multiplier. 
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Despite  the  progress  that  has  been  made,  important  technology  challenges  remain 
to  be  addressed  before  the  first  practical  superconductor  processor  can  be 
demonstrated,  namely: 

• Nb  VLSI  fabrication  process  needs  further  improvements  to  provide  a 
higher  clock  frequency,  greater  gate  density,  and  greater  reliability; 

• At  least  64  Kb,  0.1  ns  low-power  cryogenic  RAM  chips  have  to  be 
developed; 

• Circuit  designs  that  enable  lower  power  dissipation,  lower  bias  current  per 
chip,  and  higher  clock  rates  are  needed; 

• Reliable  MCM  packaging  and  parallel  data  chip-to-chip  communication 
techniques  are  to  be  demonstrated; 

• Thermally  and  electrically  efficient,  high-data-rate  cryogenic-to-ambient 
I/O  technology  has  to  be  developed; 

• Efficient  CAD  tools  to  design  and  verify  RSFQ  circuits  operating  at  25+ 
GHz  clock  frequencies  are  required. 


In  the  architecture  domain,  the  realities  of  the  current  and  future  Nb  fabrication 
processes  (especially  the  limited  gate  density  of  Nb  VLSI  logic  and  memory  chips) 
and  the  peculiarities  of  the  RSFQ  logic  must  be  fully  addressed  in  the  design  of 
RSFQ  processors  and  its  blocks  in  order  to  utilize  the  huge  potential  of 
superconductors  for  petaflops  computing. 
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